Exploratory Data Analysis on Attrition

DDS Final Project

We’ve obtained data on over 800 employees in a fortune 100 company and analyzed different factors that lead to attrition.

## Load packages
# install.packages("ggplot2")
# install.packages("GGally")
# install.packages("dplyr")
# install.packages("tidyverse")
# install.packages("e1071")
# install.packages("class")
# install.packages("readr")
# install.packages("rcurl")
# install.packages("caret")

#Load Libraries
library(readr)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.3     ✓ stringr 1.4.0
## ✓ tidyr   1.1.4     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(e1071) 
library(class)
library(RCurl)
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
#Load data
x <- getURL("https://raw.githubusercontent.com/AlonsoSalcido/CaseStudy2DDS-/main/CaseStudy.csv")
CaseStudy <- read.csv(text = x)

Current Attrition

The first step in our exploratory data analysis is to measure and visualize current attrition in the company. We currently have a 19% Attrition rate in the company.

EDA= CaseStudy
#Irrelevant Columns
drop= c("EmployeeCount", "EmployeeNumber", "Over18", "StandardHours", "Attrition.1")
EDA= EDA[ , !(names(EDA) %in% drop)]

#Graph Attrition
ggplot(EDA, aes(x="", y=Attrition, fill=Attrition)) +
  geom_bar(stat="identity", width=1) +
  ggtitle("Current Attrition Percent")+
  coord_polar("y")

#Attrition by Numbers
AttritionYes=nrow(EDA[EDA$Attrition == "Yes",])
AttritionNo=nrow(EDA[EDA$Attrition == "No",])
AttritionPercent= nrow(EDA[EDA$Attrition == "Yes",])/nrow(EDA[EDA$Attrition == "No",])
AttritionYes
## [1] 140
AttritionNo
## [1] 730
AttritionPercent
## [1] 0.1917808

Categorical Variables

For this step, the dataset has been broken down in two groups: Yes Attrition and No Attrition.

We’re looking for differences (in term of percent) between the two groups. If a variable has no impact in attrition it should have a similar percentage rate in both groups. If a substantial difference is seen between the groups, the variable may have an impact in attrition.

Travel

No substantial difference is seen in travel. The biggest difference on this category is seen in those who travel frequently. They have a larger percentage in the Attririon group than in the No Attrition group.

#Travel
ggplot(EDA,mapping=aes(x=Attrition,fill=BusinessTravel))+
  geom_bar(position="fill")+
  ggtitle("Attrition by Travel")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

##Department

When we looked at differences between Departments we noticed that the Sales department has a substantial difference between the two Groups. They tend to have higher attrition rate.

Research and Development has the lowest attrition.

ggplot(EDA,mapping=aes(x=Attrition,fill=Department))+
  geom_bar(position="fill")+
  ggtitle("Attrition by Department")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

##EducationField On the education field, those with a technical degree tend to have a higher attrion rate than the rest.

ggplot(EDA,mapping=aes(x=Attrition,fill=EducationField))+
  geom_bar(position="fill")+
  ggtitle("Attrition by EducationField")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

##Gender

After visualizing gender we can discard it as a factor of attrition.

ggplot(EDA,mapping=aes(x=Attrition,fill=Gender))+
  geom_bar(position="fill")+
  ggtitle("Attrition by Gender")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

#JobRole Job Role is an interesting variable since it has multiple categories. The job role that we will focus on is Sales Representative since it has a substantial difference between the attrition and no attrition group.

Sales Representatives tend to be more prone to attritions than any other job role in the company.

ggplot(EDA,mapping=aes(x=Attrition,fill=JobRole))+
  geom_bar(position="fill")+
  ggtitle("Attrition by JobRole")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

##MaritalStatus Marital Status can be an important factor in attrition. Those employees who are single make up a substantially bigger portion in the attrition group than in the no attrition group.

On the contrary Divorced employees tend to be less likely to voluntarily resign.

ggplot(EDA,mapping=aes(x=Attrition,fill=MaritalStatus))+
  geom_bar(position="fill")+
  ggtitle("Attrition by MaritalStatus")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

##Overtime

Over Time is one of the variables with the highest difference among the attrition and no attrition groups.

Out of all the employees in the attrition group, 57% worked over time, while only 24% of employees that work over time make up the no attrition group. This is a group that we need to keep an eye on.

ggplot(EDA,mapping=aes(x=Attrition,fill=OverTime))+
  geom_bar(position="fill")+
  ggtitle("Attrition by OverTime")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

nrow(EDA[EDA$Attrition == "Yes" & EDA$OverTime == "Yes",])/nrow(EDA[EDA$Attrition == "Yes",])
## [1] 0.5714286
nrow(EDA[EDA$Attrition == "No" & EDA$OverTime == "Yes",])/nrow(EDA[EDA$Attrition == "No",])
## [1] 0.2356164

Ordinal Variables Transformation

Make Attrition Numeric

NEDA = EDA[,c(3, 1:32)]
NEDA$Attrition=gsub('No', 0 , NEDA$Attrition)
NEDA$Attrition=gsub('Yes', 1 , NEDA$Attrition)
NEDA$Attrition= as.numeric(NEDA$Attrition) #Numeric

##Ordinal Variables Trends We see an inverse realionship on most ordinal variables at the lower levels.

#Environment Satisfaction Visualized

NEDA %>% ggplot(aes(EnvironmentSatisfaction, Attrition)) + geom_smooth()+
  ggtitle("Attrition and EnvironmentSatisfaction")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 4.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.5121e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 4.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 3.5121e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1

## Job Involvement An inverse realtionship between Job involvement and attrition can be seen.

NEDA %>% ggplot(aes(JobInvolvement, Attrition)) + geom_smooth()+
  ggtitle("Attrition and JobInvolvement")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.985
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 2.3932e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4.0602
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.985
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.3932e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4.0602

## JobLevel Indirect relationship at the lower levels.

NEDA %>% ggplot(aes(JobLevel, Attrition)) + geom_smooth()+
  ggtitle("Attrition and JobLevel")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.98
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.02
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.1629e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.98
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 2.02
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 3.1629e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4

JobSatisfaction

There’s an inverse relationship between Job Satisfaction and attrition.

NEDA %>% ggplot(aes(JobSatisfaction, Attrition)) + geom_smooth()+
  ggtitle("Attrition and JobSatisfaction")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 4.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 2.2758e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 4.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.2758e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1

## Stock Option Level We can see that the inverse relationship between stock option level and attrition is only seen at the lower levels.

NEDA %>% ggplot(aes(StockOptionLevel, Attrition)) + geom_smooth()+
  ggtitle("Attrition and Stock Option Level")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -0.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.5633e-30
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## -0.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 1.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 1.5633e-30
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1

## WorkLifeBalance Work life balance follows the same trend as other ordinal variables. The change is mostly present at the lower levels.

NEDA %>% ggplot(aes(WorkLifeBalance, Attrition)) + geom_smooth()+
  ggtitle("Attrition and WorkLifeBalance")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.985
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 2.3264e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4.0602
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.985
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.3264e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4.0602

##Continuous Variables

We continue visualizing our dataset to better understand the variables present. On this case we will use box plots to see any differences between the two groups in question.

##Age

This graph shows that the attrition group has a lower mean and median age than the no attrition group.

ggplot(EDA, aes(x=Attrition, y=Age, fill=Attrition)) + 
  geom_boxplot(notch=TRUE) +
  stat_summary(fun=mean, geom="point", shape=18, size=10, color="black", fill="red")+
  ggtitle("Attrition by Age") 

#DistanceFromHome

Distance from home shows that those who quit tend to leave farther away from their workplace.

ggplot(EDA, aes(x=Attrition, y=DistanceFromHome, fill=Attrition)) + 
  geom_boxplot(notch=TRUE) +
  stat_summary(fun=mean, geom="point", shape=18, size=10, color="black", fill="red")+
  ggtitle("Attrition by DistanceFromHome") 

##MonthlyIncome

Monthly Income shows that the attrition group tend to have a lower monthly income than the no attrition group.

ggplot(EDA, aes(x=Attrition, y=MonthlyIncome, fill=Attrition)) + 
  geom_boxplot(notch=TRUE) +
  stat_summary(fun=mean, geom="point", shape=18, size=10, color="black", fill="red")+
  ggtitle("Attrition by MonthlyIncome") 

##YearsInCurrentRole

It can be seen that the employees in the attrition group tend to have less time on their role before they leave than those who don’t.

ggplot(EDA, aes(x=Attrition, y=YearsInCurrentRole, fill=Attrition)) + 
  geom_boxplot(notch=TRUE) +
  stat_summary(fun=mean, geom="point", shape=18, size=10, color="black", fill="red")+
  ggtitle("Attrition by YearsInCurrentRole") 

##Predict Attrition

After visualizing our attrition factors and having a better understanding of our dataset we will attempt to predictr attrition by utilizing a couple of different predictive models.

Tidying up our Data for Prediction

In order to properly use our different kinds of variables to predict attrition we need to clean up and transform our data.

KNNEDA = EDA

#Drop Non Relevant Variables
drop= c("ID", "DailyRate", "Education", "Gender", "HourlyRate")
KNNEDA= KNNEDA[ , !(names(KNNEDA) %in% drop)]
KNNEDA$Attrition=factor(KNNEDA$Attrition)

#TransformCategorical Variables
KNNEDA$Non_Travel= ifelse(KNNEDA$BusinessTravel == "Non-Travel",1,0)
KNNEDA$Travel_Frequently= ifelse(KNNEDA$BusinessTravel == "Travel_Frequently",1,0) 
KNNEDA$Travel_Rarely= ifelse(KNNEDA$BusinessTravel == "Travel_Rarely",1,0) 

KNNEDA$HRDepartment= ifelse(KNNEDA$Department == "Human Resources",1,0)
KNNEDA$RDDepartment= ifelse(KNNEDA$Department == "Research & Development",1,0)
KNNEDA$SalesDepartment= ifelse(KNNEDA$Department == "Sales",1,0)

KNNEDA$HumanResourcesDegree= ifelse(KNNEDA$EducationField == "Human Resources",1,0)
KNNEDA$LifeSciencesDegree= ifelse(KNNEDA$EducationField == "Life Sciences",1,0)
KNNEDA$MarketingDegree= ifelse(KNNEDA$EducationField == "Marketing",1,0)
KNNEDA$MedicalDegree= ifelse(KNNEDA$EducationField == "Medical",1,0)
KNNEDA$OtherDegree= ifelse(KNNEDA$EducationField == "Other",1,0)
KNNEDA$TechnicalDegree= ifelse(KNNEDA$EducationField == "Technical Degree",1,0)

KNNEDA$HealthcareRep= ifelse(KNNEDA$JobRole == "Healthcare Representative",1,0)
KNNEDA$HumanResources= ifelse(KNNEDA$JobRole == "Human Resources",1,0)
KNNEDA$LaboratoryTechnician= ifelse(KNNEDA$JobRole == "Laboratory Technician",1,0)
KNNEDA$Manager= ifelse(KNNEDA$JobRole == "Manager",1,0)
KNNEDA$ManufacturingDirector= ifelse(KNNEDA$JobRole == "Manufacturing Director",1,0)
KNNEDA$ResearchDirector= ifelse(KNNEDA$JobRole == "Research Director",1,0)
KNNEDA$ResearchScientist= ifelse(KNNEDA$JobRole == "Research Scientist",1,0)
KNNEDA$SalesExec= ifelse(KNNEDA$JobRole == "Sales Executive",1,0)
KNNEDA$SalesRep= ifelse(KNNEDA$JobRole == "Sales Representative",1,0)

KNNEDA$Divorced= ifelse(KNNEDA$MaritalStatus == "Divorced",1,0)
KNNEDA$Married= ifelse(KNNEDA$MaritalStatus == "Married",1,0)
KNNEDA$Single= ifelse(KNNEDA$MaritalStatus == "Single",1,0)

KNNEDA$OverTime= ifelse(KNNEDA$OverTime == "Yes",1,0)

#Drop Transformed Variables
drop= c("BusinessTravel", "Department", "EducationField", "Gender", "JobRole", "MaritalStatus")
KNNEDA= KNNEDA[ , !(names(KNNEDA) %in% drop)]
KNNEDA= KNNEDA[,c(2, 1:46)]
KNNEDA= KNNEDA[ , !(names(KNNEDA)=="Attrition.1")]

Top 3 Factors of Attrition

Once our data is ready we can determine the top 3 factors for attrition. By running a generalized linear model and looking at factor’s p value, we see that the top 3 Factors are: OverTime, JobInvolvement, JobSatisfaction

model=glm(Attrition~.,data = KNNEDA, family= "binomial")
summary(model)
## 
## Call:
## glm(formula = Attrition ~ ., family = "binomial", data = KNNEDA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7863  -0.4455  -0.2171  -0.0632   3.2645  
## 
## Coefficients: (5 not defined because of singularities)
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               7.310e+00  1.782e+00   4.103 4.07e-05 ***
## Age                      -3.033e-02  1.842e-02  -1.647 0.099654 .  
## DistanceFromHome          5.075e-02  1.494e-02   3.398 0.000680 ***
## EnvironmentSatisfaction  -3.478e-01  1.152e-01  -3.018 0.002541 ** 
## JobInvolvement           -7.769e-01  1.701e-01  -4.568 4.92e-06 ***
## JobLevel                 -8.053e-02  4.469e-01  -0.180 0.857015    
## JobSatisfaction          -5.123e-01  1.129e-01  -4.538 5.67e-06 ***
## MonthlyIncome             9.571e-05  1.191e-04   0.804 0.421517    
## MonthlyRate              -1.518e-05  1.771e-05  -0.857 0.391324    
## NumCompaniesWorked        2.377e-01  5.444e-02   4.366 1.27e-05 ***
## OverTime                  1.971e+00  2.562e-01   7.691 1.46e-14 ***
## PercentSalaryHike        -1.829e-02  5.296e-02  -0.345 0.729837    
## PerformanceRating         3.508e-01  5.535e-01   0.634 0.526240    
## RelationshipSatisfaction -2.468e-01  1.087e-01  -2.271 0.023139 *  
## StockOptionLevel         -7.907e-02  2.078e-01  -0.381 0.703571    
## TotalWorkingYears        -1.017e-01  4.168e-02  -2.439 0.014723 *  
## TrainingTimesLastYear    -2.821e-01  1.037e-01  -2.721 0.006507 ** 
## WorkLifeBalance          -5.645e-01  1.687e-01  -3.347 0.000817 ***
## YearsAtCompany            8.341e-02  5.268e-02   1.583 0.113355    
## YearsInCurrentRole       -1.318e-01  6.027e-02  -2.187 0.028777 *  
## YearsSinceLastPromotion   2.484e-01  5.938e-02   4.184 2.87e-05 ***
## YearsWithCurrManager     -1.607e-01  6.314e-02  -2.544 0.010950 *  
## Non_Travel               -8.333e-01  4.518e-01  -1.844 0.065114 .  
## Travel_Frequently         7.097e-01  3.170e-01   2.239 0.025169 *  
## Travel_Rarely                    NA         NA      NA       NA    
## HRDepartment             -1.395e+01  7.601e+02  -0.018 0.985359    
## RDDepartment              1.414e-01  1.257e+00   0.112 0.910435    
## SalesDepartment                  NA         NA      NA       NA    
## HumanResourcesDegree      9.643e-01  1.252e+00   0.770 0.441243    
## LifeSciencesDegree       -6.154e-01  4.295e-01  -1.433 0.151927    
## MarketingDegree          -5.060e-01  5.508e-01  -0.919 0.358258    
## MedicalDegree            -6.369e-01  4.406e-01  -1.445 0.148352    
## OtherDegree              -3.328e-01  6.131e-01  -0.543 0.587200    
## TechnicalDegree                  NA         NA      NA       NA    
## HealthcareRep            -2.170e+00  1.438e+00  -1.509 0.131232    
## HumanResources            1.275e+01  7.601e+02   0.017 0.986619    
## LaboratoryTechnician     -1.163e+00  1.330e+00  -0.874 0.381960    
## Manager                  -2.164e+00  1.390e+00  -1.557 0.119469    
## ManufacturingDirector    -3.828e+00  1.574e+00  -2.432 0.014999 *  
## ResearchDirector         -4.359e+00  1.965e+00  -2.219 0.026488 *  
## ResearchScientist        -1.832e+00  1.334e+00  -1.373 0.169825    
## SalesExec                -1.469e+00  5.521e-01  -2.660 0.007812 ** 
## SalesRep                         NA         NA      NA       NA    
## Divorced                 -1.776e+00  5.109e-01  -3.475 0.000511 ***
## Married                  -8.321e-01  3.295e-01  -2.526 0.011548 *  
## Single                           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 767.67  on 869  degrees of freedom
## Residual deviance: 472.49  on 829  degrees of freedom
## AIC: 554.49
## 
## Number of Fisher Scoring iterations: 15

KNN

The first model used to predict attrition is KNN. KNN gave us an 80% Accuracy, High Sensitivity, but low specificity.

KNNEDA1= KNNEDA[, c(1,3,4,6, 11, 15,18, 23, 28, 34, 39, 43,46)]
set.seed(1)
confusionMatrix(table(knn.cv(KNNEDA1[,2:12],KNNEDA1$Attrition, k = 1), KNNEDA1$Attrition))
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  662 101
##   Yes  68  39
##                                           
##                Accuracy : 0.8057          
##                  95% CI : (0.7779, 0.8315)
##     No Information Rate : 0.8391          
##     P-Value [Acc > NIR] : 0.99611         
##                                           
##                   Kappa : 0.2049          
##                                           
##  Mcnemar's Test P-Value : 0.01383         
##                                           
##             Sensitivity : 0.9068          
##             Specificity : 0.2786          
##          Pos Pred Value : 0.8676          
##          Neg Pred Value : 0.3645          
##              Prevalence : 0.8391          
##          Detection Rate : 0.7609          
##    Detection Prevalence : 0.8770          
##       Balanced Accuracy : 0.5927          
##                                           
##        'Positive' Class : No              
## 

Attrition-Naive Bayes

Our last model Naive Bayes gave us the best results. We achieved: 86% Accuracy 91% Sensitivity 58% Specificity

set.seed(1)
trainIndices = sample(1:dim(KNNEDA)[1],round(0.7 * dim(KNNEDA)[1]))
train = KNNEDA[trainIndices,]
test = KNNEDA[-trainIndices,]

model_train = naiveBayes(Attrition~.,data = train)
predict(model_train, train[,c(4,5,6,8,10,11,15,18,19,24,43,46)])
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'Age'. Did you use factors
## with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'DistanceFromHome'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'JobSatisfaction'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'MonthlyRate'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'PercentSalaryHike'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'PerformanceRating'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'RelationshipSatisfaction'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'TotalWorkingYears'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'TrainingTimesLastYear'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'YearsInCurrentRole'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'YearsSinceLastPromotion'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'YearsWithCurrManager'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'Non_Travel'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'Travel_Rarely'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'HRDepartment'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'RDDepartment'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'SalesDepartment'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'HumanResourcesDegree'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'LifeSciencesDegree'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'MarketingDegree'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'MedicalDegree'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'OtherDegree'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'TechnicalDegree'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'HealthcareRep'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'HumanResources'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'LaboratoryTechnician'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, :
## Type mismatch between training and new data for variable 'Manager'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'ManufacturingDirector'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'ResearchDirector'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'ResearchScientist'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'SalesExec'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'Divorced'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, :
## Type mismatch between training and new data for variable 'Married'. Did you use
## factors with numeric labels for training, and numeric values for new data?
##   [1] No  No  Yes Yes No  Yes No  No  No  No  No  No  No  No  No  No  No  No 
##  [19] No  Yes No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No  No 
##  [37] No  No  Yes No  No  Yes Yes No  No  No  No  No  Yes No  No  No  No  No 
##  [55] No  No  No  No  No  No  No  No  No  No  No  No  No  Yes No  No  Yes No 
##  [73] No  No  No  No  No  No  No  No  No  No  No  No  No  Yes No  No  Yes No 
##  [91] No  No  No  No  No  No  No  No  No  Yes Yes No  No  No  No  No  No  No 
## [109] No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No  No  No 
## [127] No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No  No  No 
## [145] No  No  No  No  Yes No  No  No  No  No  No  No  No  Yes No  No  Yes No 
## [163] No  No  No  No  No  No  Yes No  No  No  No  No  No  No  Yes No  No  No 
## [181] No  No  No  No  No  No  Yes No  No  No  No  No  Yes No  No  Yes No  No 
## [199] Yes No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No 
## [217] No  No  Yes No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No 
## [235] No  No  No  No  No  No  No  No  No  No  No  No  Yes No  Yes No  No  No 
## [253] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
## [271] No  No  No  Yes No  No  Yes No  No  No  No  No  No  No  No  No  No  No 
## [289] No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No  Yes
## [307] Yes No  No  No  No  No  No  No  No  Yes No  Yes No  Yes No  No  No  No 
## [325] No  No  No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No 
## [343] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
## [361] No  No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No  No  No 
## [379] No  No  Yes No  Yes No  No  No  No  No  No  No  No  No  No  No  No  No 
## [397] No  No  Yes No  Yes No  No  No  No  No  No  No  No  Yes No  No  No  No 
## [415] No  No  No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No  No 
## [433] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  Yes No 
## [451] No  No  Yes No  No  No  No  Yes No  No  No  Yes No  No  No  No  Yes Yes
## [469] No  No  No  No  No  No  Yes No  No  No  No  Yes No  No  No  No  No  No 
## [487] No  No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No 
## [505] No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No  Yes No  Yes
## [523] No  No  No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No  No 
## [541] No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No  No  No 
## [559] No  No  No  No  No  No  No  No  No  No  No  No  No  Yes Yes No  No  No 
## [577] No  No  Yes No  No  No  No  No  No  No  No  No  No  No  No  No  No  Yes
## [595] No  No  No  No  Yes No  No  No  No  No  No  No  No  Yes No 
## Levels: No Yes
model_test = naiveBayes(Attrition~.,data = test)
classifications=predict(model_test, test[,c(4,5,6,8,10,11,15,18,19,24,43,46)])
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Age'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'DistanceFromHome'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'JobSatisfaction'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'MonthlyRate'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'PercentSalaryHike'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'PerformanceRating'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10,
## 11, 15, : Type mismatch between training and new data for variable
## 'RelationshipSatisfaction'. Did you use factors with numeric labels for
## training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'TotalWorkingYears'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10,
## 11, 15, : Type mismatch between training and new data for variable
## 'TrainingTimesLastYear'. Did you use factors with numeric labels for training,
## and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'YearsInCurrentRole'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10,
## 11, 15, : Type mismatch between training and new data for variable
## 'YearsSinceLastPromotion'. Did you use factors with numeric labels for training,
## and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'YearsWithCurrManager'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Non_Travel'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Travel_Rarely'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'HRDepartment'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'RDDepartment'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'SalesDepartment'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'HumanResourcesDegree'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'LifeSciencesDegree'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'MarketingDegree'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'MedicalDegree'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'OtherDegree'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'TechnicalDegree'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'HealthcareRep'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'HumanResources'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'LaboratoryTechnician'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Manager'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10,
## 11, 15, : Type mismatch between training and new data for variable
## 'ManufacturingDirector'. Did you use factors with numeric labels for training,
## and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'ResearchDirector'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'ResearchScientist'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'SalesExec'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Divorced'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Married'. Did you use
## factors with numeric labels for training, and numeric values for new data?
CM_NaiveBayes=confusionMatrix(table(classifications,test$Attrition))
CM_NaiveBayes
## Confusion Matrix and Statistics
## 
##                
## classifications  No Yes
##             No  201  18
##             Yes  19  23
##                                           
##                Accuracy : 0.8582          
##                  95% CI : (0.8099, 0.8982)
##     No Information Rate : 0.8429          
##     P-Value [Acc > NIR] : 0.2801          
##                                           
##                   Kappa : 0.4699          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.9136          
##             Specificity : 0.5610          
##          Pos Pred Value : 0.9178          
##          Neg Pred Value : 0.5476          
##              Prevalence : 0.8429          
##          Detection Rate : 0.7701          
##    Detection Prevalence : 0.8391          
##       Balanced Accuracy : 0.7373          
##                                           
##        'Positive' Class : No              
## 

Interesting Finding and Final Recommendations

In our analysis we found it interesting that when it comes ordinal variables, the lowest level is the one that seems to have the most difference between those who stay and those who don’t stay.

In general the Attrition group tends to have more employees who rank on the bottom level of the lowest level of the variable. We don’t see that same difference on the rest of the levels.

Our recommendation is to focus on those employees that have no stock level, or have the lowest rank in Environment Satisfaction, Job Involvement, Job Level, Job Satisfaction, and work life balance.

If we can move those employees just one level higher we can have an impact on attrition.

Below are some graphs visualizing those differences.

ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(EnvironmentSatisfaction)))+
  geom_bar(position="fill")+
  ggtitle("Attrition by EnvironmentSatisfaction")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(JobInvolvement)))+
  geom_bar(position="fill")+
  ggtitle("Attrition by JobInvolvement")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(JobLevel)))+
  geom_bar(position="fill")+
  ggtitle("Attrition by JobLevel")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(JobSatisfaction)))+
  geom_bar(position="fill")+
  ggtitle("Attrition by JobSatisfaction")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(StockOptionLevel)))+
  geom_bar(position="fill")+
  ggtitle("Attrition by StockOptionLevel")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())

ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(WorkLifeBalance)))+
  geom_bar(position="fill")+
  ggtitle("Attrition by WorkLifeBalance")+
  ylab("Percentage")+
  scale_y_continuous(labels = scales::percent_format())